feat(cloudtrail-alerts): per-detector exclusions + 8 new detectors + SSO MFA fix#277
Open
Cre-eD wants to merge 4 commits into
Open
feat(cloudtrail-alerts): per-detector exclusions + 8 new detectors + SSO MFA fix#277Cre-eD wants to merge 4 commits into
Cre-eD wants to merge 4 commits into
Conversation
…SSO MFA fix Why: production deployment exposed alert-fatigue patterns — ~16 fires/day with ~89% false positives from governed automation (Pulumi CI bot, Prowler scanner, AWS Identity Center sessions, AWS service-linked roles). Built-in detectors matched the CIS CloudWatch.1-14 set verbatim, leaving no knob to tune. Plugin extensions: - `CloudTrailAlertOverride` schema with optional per-detector exclusions (userName / principalId / arn / arn-glob / userType / invokedBy) baked into the CloudWatch metric filter pattern at provision time. Suppression happens at the metric layer rather than in the Lambda enrichment step so excluded events never increment the metric, the alarm never trips, and the CW alarm dashboard reflects real signal (preserves SOC2 CC6.1 / ISO 27001 audit clarity). Threshold / period / evaluationPeriods overrides too. - `consoleLoginWithoutMfa` now scoped to `userIdentity.type = "IAMUser"`. AWS Identity Center / federated console sessions always emit `MFAUsed = "No"` because MFA happens upstream at the IdP — without the scope, every SSO console session triggered a CloudWatch.3 false positive. Identity Center coverage belongs in a separate detector against `signin.amazonaws.com / UserAuthentication` (not added here). - 8 beyond-CIS detectors covering attacker-blinding moves and exposure paths that CIS does not address: GuardDuty disable, Security Hub disable, IAM access key creation, S3 Block Public Access toggle, Lambda Function URL with AuthType=NONE, KMS key policy / grant changes, AWS Organizations SCP churn, anonymous external probes (`userIdentity.type=AWSAccount` AccessDenied with default threshold=10 to require sustained reconnaissance, not page on one-off probes). Backwards compat: plain-`bool` selector form is unchanged. Existing consumers who declare `iamPolicyChanges: true` see identical filter patterns and alarm parameters. New struct fields and `overrides` map are opt-in. Tests: 14 new test cases covering exclusion clause generation, deterministic ordering across YAML re-orderings, threshold / period override application, empty-string skipping, and SSO MFA scope. Existing tests updated for totalDetectors = 22 (14 CIS + 8 additions). Signed-off-by: Dmitrii Creed <creeed22@gmail.com>
Semgrep Scan ResultsRepository:
Scanned at 2026-05-20 11:43 UTC |
Security Scan ResultsRepository:
Scanned at 2026-05-20 11:44 UTC |
Cross-model review of PR #277 (codex + gemini, both fact-checked against AWS primary docs) surfaced 6 correctness bugs that would have shipped to prod. Each is fixed here with a test that asserts the desired behavior. 1. NOT EXISTS guard on every exclusion clause. CloudWatch metric-filter `$.field != "x"` returns FALSE when the field is absent, not TRUE. Without a NOT EXISTS guard, an unguarded exclusion like `$.userIdentity.userName != "bot"` would silently drop every event whose userIdentity lacks the field at the top level — AssumedRole events in particular carry userName at $.userIdentity.sessionContext.sessionIssuer .userName, NOT at $.userIdentity.userName. The unguarded form would have dropped the ENTIRE detector for every assumed-role principal, not just the bot we meant to exclude. buildExclusionClauses now generates: (($.field NOT EXISTS) || ($.field != "v1")) # single value (($.field NOT EXISTS) || (($.field != "v1") && ($.field != "v2"))) # multi-value Multi-value uses inner-AND (De Morgan'd from "NOT (v1 OR v2)"). New tests: TestApplyOverride_Exclusions, TestApplyOverride_NotExistsGuard_MultipleValues, TestApplyOverride_MultipleExclusionFields, TestApplyOverride_DeDupesValues. Ref: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/FilterAndPatternSyntax.html Ref: https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-event-reference-user-identity.html 2. Remove PutResourcePolicy from kmsKeyPolicyChanges. PutResourcePolicy is not a KMS API. (It exists on CloudTrail Lake and a few other services, but never under eventSource=kms.amazonaws.com, so the clause was permanent dead code.) KMS uses PutKeyPolicy for resource policy edits. Comment updated to record why this is excluded to prevent re-introduction. Ref: https://docs.aws.amazon.com/kms/latest/APIReference/API_Operations.html 3. Expand organizationsChanges to cover OU + delegated admin moves. Added MoveAccount, RemoveAccountFromOrganization, RegisterDelegatedAdministrator, DeregisterDelegatedAdministrator, EnableAWSServiceAccess, DisableAWSServiceAccess. RegisterDelegatedAdministrator is the canonical "blind the management account" move: delegate GuardDuty/Security Hub admin to a compromised member, then suppress findings there. Ref: https://docs.aws.amazon.com/organizations/latest/APIReference/API_Operations.html 4. Add Root to consoleLoginWithoutMfa scope. Earlier fix scoped to userIdentity.type=IAMUser to silence the AWS Identity Center false positive. That silently dropped Root console logins (Root type is "Root", not "IAMUser"). rootAccountUsage detector catches ALL Root activity including MFA-protected sessions; this one specifically surfaces Root-without-MFA for triage. Scope expanded to ($.userIdentity.type = "IAMUser" || = "Root"). 5. Remove DeleteInsight from securityHubDisabled. Insights are saved dashboard searches, not detectors. Deleting one is a UI/ housekeeping action, not a "visibility lost" event — pure noise. 6. Fail-fast validation of overrides keys + reflection test for wireup. - New `validateOverrides()` checked at provision time: rejects map keys that don't correspond to any detector. Catches YAML typos like `overrides: { unauthorizedApiCall: ... }` (missing trailing s) at deploy time with a list of valid keys, instead of silently dropping the override and leaving the operator wondering why their exclusion didn't take. - selectorChecks extracted to a single source of truth so reflection test can assert bidirectional consistency: every selector bool maps to a securityAlerts entry and vice versa, count == totalDetectors. Catches the regression where a contributor adds a detector but forgets the wireup. New tests: TestValidateOverrides_Empty, TestValidateOverrides_KnownKey, TestValidateOverrides_UnknownKeyIsLoudError, TestSelectorChecksWireUpAllDetectors, TestApplyOverride_WorstCaseBasePattern (covers leading whitespace + internal OR-groups in base filter pattern under the trim-and-rewrap path). Test summary: 17 tests now pass (was 14 in the first push of PR #277). Signed-off-by: Dmitrii Creed <creeed22@gmail.com>
Contributor
Author
Round 2: review-driven correctness fixesFollowup commit Fixed in this PR
Wontfix (filed in head — recorded as follow-ups, not blockers)
Tests17 tests pass (was 14). |
Merged
4 tasks
Cre-eD
added a commit
that referenced
this pull request
May 20, 2026
…IEW_BRANCH (+SHA pin) (#278) ## Why Production users testing a feature-branch SC build via `sc.sh` are blocked by the Phase 2c failgate: the cert-identity regex passed to `cosign verify-blob` is hard-pinned to `push.yaml@refs/heads/main`. Preview tarballs published by `branch-preview.yaml` are legitimately Sigstore-signed but cannot install. The only workarounds today are `curl + tar -xz` (bypasses signature verification entirely) or fork sc.sh locally. Both are worse than a documented, narrowly-scoped opt-in. Discovered while validating the CloudTrail security alerts plugin (PR #277) against a preview build of that branch. ## What ### `sc.sh` — branch-pinned (and optionally SHA-pinned) opt-in ```bash # Minimum (branch-pinned): SIMPLE_CONTAINER_TRUST_PREVIEW_BRANCH=feat/your-branch \ SIMPLE_CONTAINER_VERSION=YYYY.M.D-pre.<sha>-preview.<sha> \ bash <(curl -Ls https://dist.simple-container.com/sc.sh) # Recommended for CI (branch + commit SHA pin): SIMPLE_CONTAINER_TRUST_PREVIEW_BRANCH=feat/your-branch \ SIMPLE_CONTAINER_TRUST_PREVIEW_SHA=4cc1a03ca5c259a428e07d4f0bb8eb9120a6e2b7 \ SIMPLE_CONTAINER_VERSION=YYYY.M.D-pre.<sha>-preview.<sha> \ bash <(curl -Ls https://dist.simple-container.com/sc.sh) ``` `verify_sc_tarball` widens the accepted identity regex to ALSO accept `branch-preview.yaml@refs/heads/<that exact branch>`. The branch name is allowlist-validated (`^[A-Za-z0-9._/-]+$`, no `..` / leading-trailing-slash / `.lock`-suffix) and the `.` regex metachar is escaped before interpolation, so a value like `evil.*` or `feat.foo` cannot re-open the permissive trust set. When `_SHA` is set, cosign is given `--certificate-github-workflow-sha` which verifies against the Sigstore certificate's GitHub OIDC `workflow_sha` claim (OID `1.3.6.1.4.1.57264.1.3`) — pinning verification to a specific commit, not the mutable branch head. ### `sc.sh` — rejects the over-permissive `=1` form outright The prior commit had `SIMPLE_CONTAINER_ALLOW_PREVIEW=1` (any branch). Codex + Gemini both flagged that as a meaningful trust expansion — anyone with push access to any branch in the repo could ship a tarball that the regex would accept once a user opted in for "their" branch. Now `=1` fails closed with a pointer to `_BRANCH`. ### `sc.sh` — loud stderr warning + smarter error messages - Every preview install emits a multi-line stderr warning listing the trusted branch + whether SHA pin is in effect. Mitigates the "persistent export in shell rc silently relaxes trust" footgun. - On verification failure in strict mode, the error message detects the preview-signed case, parses the signer branch out of cosign's `got subjects [...]` line, and offers it back as a copy-paste env-var value. ### `docs/SECURITY.md` — documentation New subsection under "Verifying tarballs" documenting: - `SIMPLE_CONTAINER_TRUST_PREVIEW_BRANCH` + `SIMPLE_CONTAINER_TRUST_PREVIEW_SHA` usage - Why "trust all preview" is intentionally not supported - Manual `cosign verify-blob` equivalent for preview tarballs (branch + SHA pin) ## Security analysis Threats addressed: | Threat | Fix | |---|---| | Any-branch trust | `_BRANCH` required; `=1` rejected; branch allowlist + regex escape | | Workflow file content not pinned | Optional `_SHA` via `--certificate-github-workflow-sha` (verifies OID 1.3.6.1.4.1.57264.1.3) | | Persistent env var = ambient policy | Loud stderr warning on every preview install; `_BRANCH` value must change per preview, narrowing the persistence window | | CDN-account compromise serves different-branch tarball | Branch pin neutralizes — different branch's tarball at the requested URL fails verification | | CDN serves old commit of the same branch | SHA pin (recommended) neutralizes | | Regex injection / identity shadowing | Branch allowlist + escape of `.` metachar | Threats explicitly out of scope (separate concerns, documented in commit message): - Same CDN namespace (publish side, not verify side) - Installed `sc --version` vs requested interlock (defense-in-depth, separate hardening pass) - Path-traversal-safe extraction (pre-existing concern in extract logic) ## Testing 8 manual test cases against the live preview tarball `v2026.5.26-pre.4cc1a03-preview.4cc1a03`: | # | Setup | Expected | Result | |---|---|---|---| | A | no env vars | strict reject + auto-extracted hint | ✅ | | B | `ALLOW_PREVIEW=1` | fail-fast before cosign | ✅ | | C | `_BRANCH=evil.*` | validation reject | ✅ | | D | `_BRANCH=feat/wrong` (valid format, doesn't match signer) | cosign mismatch | ✅ | | E | correct `_BRANCH` only | success + warning | ✅ | | F | correct `_BRANCH` + correct `_SHA` | success + SHA-pinned warning | ✅ | | G | correct `_BRANCH` + wrong `_SHA` (40-hex zeros) | cosign rejects | ✅ | | H | correct `_BRANCH` + invalid `_SHA` (`not-a-sha`) | validation reject | ✅ | Semgrep `shell-curl-pipe-to-shell` rule: 0 findings (curl|sh example was removed from error text). ## Test plan - [x] Manual: all 8 paths verified locally - [x] Semgrep CI clean (was failing on the previous commit — fixed by removing the curl|sh example string from error text) - [ ] CI: branch-preview.yaml run on this fix branch confirms the rebuilt sc.sh still works for production installs - [ ] After merge + a push.yaml release: downstream consumers (integrail/devops install-sc, PAY-SPACE wrappers) gain the opt-in path automatically; PR #277's preview build becomes installable via `_BRANCH` + `_SHA` pin ## Refs - PR #277 — the trigger for needing this; its preview build is the first artifact unblocked by this change - Round-2 review comment with full triage: #278 (comment) --------- Signed-off-by: Dmitrii Creed <creeed22@gmail.com>
Round-3 polish driven by 14-day production data from everworker (n=232 fires)
plus a 107-min post-apply observation window. Two changes — both validated
against codex + gemini security review.
## Change 1: split kmsKeyPolicyChanges → kmsKeyPolicy + kmsKeyGrants
Production observation: in 107 min post-apply, integrail-deployer-bot
(the Pulumi CI principal) generated 27 CreateGrant + 19 RetireGrant events
on KMS — natural deploy churn (every encrypted resource needs a grant).
PutKeyPolicy events: 0. The merged detector with default threshold 1 would
either flood the security-critical channel on every deploy OR mask a real
PutKeyPolicy event under a high threshold tuned for Pulumi noise.
Split into two detectors with separate signal models:
- kmsKeyPolicy (default ON, threshold 1)
- Matches: PutKeyPolicy
- "Who can use this key" is a structural security-boundary change. Rare
even in active orgs; page on any occurrence.
- kmsKeyGrants (default OFF, threshold 10/300s)
- Matches: CreateGrant / RetireGrant / RevokeGrant
- High-volume in IaC-driven environments. Default off so most consumers
won't see noise; opt-in for orgs with specific audit needs, with the
expectation they'll add a principal exclusion override.
Backwards-compat: the kmsKeyPolicyChanges selector is GONE (not aliased).
Consumers using the old name will get a loud `validateOverrides` error at
deploy time pointing at the new keys. SC #277 is not yet released so the
BC break has zero downstream impact.
Codex + gemini both recommended split; aliasing was discussed and rejected
since there are no existing consumers to protect.
## Change 2: tighten unauthorizedApiCalls default
Two changes to the base filter + default threshold:
a) Default-exclude userIdentity.type = "AWSService"
60% of historical AccessDenied events in our production sample were AWS
service-linked roles (Macie, AWS Config, ResourceExplorer, Inspector)
probing optional services for capability discovery. Zero security
value: an attacker who compromises a service-linked role surfaces as
AssumedRole in CloudTrail, not AWSService (service-linked role sessions
never directly produce customer-side API calls in attacker scenarios).
Excluding at base saves every downstream consumer from re-implementing
the same override.
b) Default-exclude userIdentity.type = "AWSAccount"
AWSAccount-type AccessDenied events are covered by the dedicated
anonymousProbes detector (same threshold 10/300s). Making the two
detectors disjoint avoids double-paging on the same event class in
the high-volume case. Low-volume cross-account access (stale trust,
third-party integration failure) was below threshold in either
detector anyway, so no signal is lost.
c) Raise default threshold from 5 to 10
CIS Benchmark "1 AccessDenied = alert" is too noisy for any active
AWS account. 5 was an arbitrary middle-ground in the original PR;
production data showed 10/300s absorbs natural permission-evaluation
noise (eventual consistency, mistyped CLI commands, optional-service
self-probes after the AWSService exclude) without missing real bursts.
Aligned with anonymousProbes' threshold for symmetry.
NOT EXISTS guard preserved on both type clauses so events where
$.userIdentity.type is absent don't get silently dropped.
## Tests (28 total, was 23)
Five new tests lock in the new defaults:
- TestUnauthorizedApiCalls_DefaultThresholdIsTen
- TestUnauthorizedApiCalls_DefaultExcludesAWSService
- TestUnauthorizedApiCalls_DefaultExcludesAWSAccount
- TestKmsKeyPolicy_HighSignalDefault (PutKeyPolicy only, no grants)
- TestKmsKeyGrants_HighVolumeDefault (grants only, threshold 10, no PutKeyPolicy)
- TestKmsKeyPolicyChanges_OldNameRemoved (validateOverrides catches old YAML)
totalDetectors bumped 22 → 23 (kms split). Reflection test asserts every
bool selector wires through enabledAlerts.
## What consumers need to update
YAML (`.sc/stacks/<stack>/server.yaml`):
OLD: NEW:
kmsKeyPolicyChanges: true kmsKeyPolicy: true # default-on, page on PutKeyPolicy
# kmsKeyGrants: false # opt-in for grant-lifecycle audit
overrides: overrides:
unauthorizedApiCalls: unauthorizedApiCalls:
threshold: 10 # remove — plugin default is now 10
excludeUserTypes: ["AWSService"] # remove — plugin default now excludes AWSService + AWSAccount
Migration captured in the devops PR #428 update.
Signed-off-by: Dmitrii Creed <creeed22@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
Production deployment in
everworkerexposed the alert-fatigue profile the plugin produces today: ~16 fires/day, ~89% false positives. Over 14 days of CloudWatch alarm history:unauthorizedApiCallsGetFunctionUrlConfig(only real signal in the bucket).iamPolicyChangesintegrail-deployer-bot(Pulumi CI from Hetzner runner IPs)securityGroupChangesintegrail-deployer-bot(Pulumi CI deploy churn)s3BucketPolicyChangesintegrail-deployer-bot+ 6%github-actions-pulumi-githubconsoleLoginWithoutMfaThe built-in detectors matched the CIS CloudWatch.1-14 set verbatim and exposed only
booltoggles, so the only tuning options were "leave the noise on" or "disable a CIS control entirely." This PR adds knobs.What
1. Per-detector overrides
New
CloudTrailAlertOverridestruct, keyed by detector name inselectors.overrides:Exclusion clauses are baked into the CloudWatch metric filter pattern at provision time (suppression at source — excluded events never increment the metric, the alarm never trips, the Lambda is never invoked, the CW alarm dashboard reflects real signal). Threshold / period / evaluationPeriods overrides are exposed too.
Architecture choice rationale (cross-checked with codex + gemini): source-side suppression over runtime suppression because:
2. SSO MFA false positive fix
consoleLoginWithoutMfais now scoped touserIdentity.type = "IAMUser". AWS Identity Center / federated console sessions always emitConsoleLoginwithadditionalEventData.MFAUsed = "No"because MFA is enforced at the IdP, not at the AWS console step. Without this scope every SSO admin login triggered a CloudWatch.3 alert. Identity Center coverage belongs in a separate detector againstsignin.amazonaws.com / UserAuthenticationevents (not added here — different signal, different threshold model).References:
ConsoleLogincan showMFAUsed=No.UserAuthenticationis the canonical SSO sign-in event.3. Eight beyond-CIS detectors
Default off (opt-in). Cover attacker-blinding moves and exposure paths CIS CloudWatch.1-14 does not address:
guardDutyDisabledDeleteDetector,UpdateDetector,Disassociate/Delete/StopMonitoringMemberssecurityHubDisabledDisableSecurityHub,BatchDisableStandards,DisableImportFindingsForProduct,DeleteActionTarget,DeleteInsightaccessKeyCreationCreateAccessKeys3PublicAccessChangesPut/DeleteAccountPublicAccessBlock,Put/DeleteBucketPublicAccessBlocklambdaUrlPublicCreateFunctionUrlConfig/UpdateFunctionUrlConfigwithauthType = "NONE"kmsKeyPolicyChangesPutKeyPolicy,PutResourcePolicy,CreateGrant,Retire/RevokeGrantorganizationsChangesEnable/DisablePolicyType+LeaveOrganization+RemoveAccountFromOrganizationanonymousProbesuserIdentity.type=AWSAccountAccessDenied probes, threshold 10/5minBackwards compatibility
boolselector form is unchanged. Existing consumers see identical filter patterns + alarm parameters.overridesmap is optional; zero-valueCloudTrailAlertOverrideis a no-op.Tests
Existing tests updated for
totalDetectors = 22. 9 new test cases:TestApplyOverride_Exclusions— single exclusion fieldTestApplyOverride_MultipleExclusionFields— all 6 exclusion fields togetherTestApplyOverride_Deterministic— YAML re-ordering produces identical filter pattern (no Pulumi churn)TestApplyOverride_ThresholdAndPeriod— non-zero overrides appliedTestApplyOverride_EmptyOverrideIsNoop— zero value is no-opTestApplyOverride_EmptyStringsSkipped—["", "real-bot", ""]produces one clause, not threeTestEnabledAlerts_OverrideApplied— overrides reachenabledAlertsTestConsoleLoginWithoutMfa_RestrictedToIAMUser— locks down the SSO FP fixTestAnonymousProbes_DefaultThreshold— locks down the threshold=10 defaultTest plan
go test ./pkg/clouds/aws/... ./pkg/clouds/pulumi/aws/...passesgo vet ./pkg/clouds/aws/... ./pkg/clouds/pulumi/aws/...cleanintegrail/devopsPR uses the new fields against a live preview environment; verify alarms tripping → metric filter no longer matches Pulumi CI events → alarm history shows the gapeverworkerprod; re-measure 14d fire rate, target ≤2/week